Inferring Correlation Networks from Genomic Survey Data
نویسندگان
چکیده
High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.
منابع مشابه
Inferring Gene Dependency Networks from Genomic Longitudinal Data: a Functional Data Approach
• A key aim of systems biology is to unravel the regulatory interactions among genes and gene products in a cell. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around an estimator of dynamical pairwise correlation that takes account of the functional nature of the observed data. This allows to...
متن کاملA shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.
Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation...
متن کاملThe Network Architecture of the Saccharomyces cerevisiae Genome
We propose a network-based approach for surmising the spatial organization of genomes from high-throughput interaction data. Our strategy is based on methods for inferring architectural features of networks. Specifically, we employ a community detection algorithm to partition networks of genomic interactions. These community partitions represent an intuitive interpretation of genomic organizati...
متن کاملSmall-Sample Analysis and Inference of Networked Dependency Structures from Complex Genomic Data
plications in Genetics and Molecular Biology 4: Article 32. Juliane Schäfer und Korbinian Strimmer. 2005. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21:754–764. Juliane Schäfer und Korbinian Strimmer. 2005. Learning large-scale graphical Gaussian models from genomic data. Summary The present work is concerned with modeling and inferring geneti...
متن کاملInferring Weighted Directed Association Networks from Multivariate Time Series with the Small-Shuffle Symbolic Transfer Entropy Spectrum Method
Complex network methodology is very useful for complex system exploration. However, the relationships among variables in complex systems are usually not clear. Therefore, inferring association networks among variables from their observed data has been a popular research topic. We propose a method, named small-shuffle symbolic transfer entropy spectrum (SSSTES), for inferring association network...
متن کامل